Thanks to my Data Science Professor Hector Corida Bravo and his notes. These helped me immensely for this project.
Also, please forgive my creative spelling. I grew up with spellcheck, and RStudio is running slow enough without an extra pugin running.
A popular source of argument among friends, enemies, frenemies, and family members (which could be in any of the previous categories) is the state of gun ownership and gun violence in the United States today. I know I have (cordially of course) discussed this issue a fair bit, especially after the Vegas and Parkland shootings. A common train of disagreement that tends to come from these conversations is the issue of regulation. Some people associated with a particular political leaning argue that more regulation leads to fewer guns which leads to less gun violence. Other people with the opposite political inclination argue that the regulations would only be followed by people who wouldn’t be committing gun crimes anyway, and the regulations would just give wrongdoers more incentive to do wrong without the fear of an armed bystander. However, rather than presenting unbiased information to the public, news sources have decided to join one side of the argument or the other and sensationalize the issues to make more money. On the regulation-is-good side of the argument (henceforth referred to as pro-regulation), there is this article and this more scholarly article from Stanford. On the other, regulation-is-bad side of the argument (henceforth the anti-regulation stance) is this article. Because, as mentioned before, news is sensationalist and we can’t believe anything they say, we will do the data ourselves. We will be getting our information on gun crime from this Kaggle data set and cross referencing it with this data set on gun laws. We will play with weights based on issues that matter to us using this supporting table. Finally, we will adjust our gun crime rates for population using a dataset from the census bureau. After we store this data in R and make it pretty and clean, we will visualize it with multiple charts, perform some regression work, and even use ML to predict future gun violence rates. Of course, I don’t know where this data came from exactly, it could come from people who are as bad as the news sources. Also, we must understand that there are several factors at work in an issue like this one, and just because a correlation was found doesn’t mean there is causation.
In terms of hypothesis testing, our null hypothesis is that there is no relation between gun violence and gun regulation, and our alternative hypothesis is that there is a relation between gun violence and gun regulation. More general hypothesis test information can be found here.
As a final thought, I wrote the preceding without doing any data analysis, so I will be just as surprised by the results of these procedures when I finish running them as you will be when you read this for the first time!
We will be doing this project in the language R in the RStudio. You can download RStudio here. If you want to get some more background information on R and RStudio, visit this page, which has links to various resources for familiarizing yourself with R. I will try to be very thorough here too, however.
Once you have R installed and have familiarized yourself with it to the extent you desire, sign up for a Kaggle account (I used my burner Google account and it was quite easy) and download the Kaggle data set and the data set on gun laws as CSV files. The Kaggle will download as a zipped folder with a CSV in it, you have to choose CSV from a list of options for the other data. The gun laws will default to only 2017, use the tools to the left to select years 2013-2017, then click the ‘CSV’ button. Click on this link to download the census dataset. You can open these files in Microsoft Excell or another spreadsheet program to inspect the data or look at the raw CSVs in a text editor to get a feel for what R will actually be seeing.
Once you have RStudio open, got to File->New Project and create a directory for this project. This is what I will call the project’s home directory, and all associated files for the project should be put in this folder.
We will load in the CSV for the Kaggle gun violence data first. If you haven’t already put the CSV in the project’s home directory, do so.
#This library has everything and is fantastic, I've used it for
#every R project I've done
library(tidyverse)
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
violence_full <- read.csv("gun-violence-data_01-2013_03-2018.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
violence_full %>% select(1:5)%>%head()
## incident_id date state city_or_county
## 1 461105 2013-01-01 Pennsylvania Mckeesport
## 2 460726 2013-01-01 California Hawthorne
## 3 478855 2013-01-01 Ohio Lorain
## 4 478925 2013-01-05 Colorado Aurora
## 5 478959 2013-01-07 North Carolina Greensboro
## 6 478948 2013-01-07 Oklahoma Tulsa
## address
## 1 1506 Versailles Avenue and Coursin Street
## 2 13500 block of Cerise Avenue
## 3 1776 East 28th Street
## 4 16000 block of East Ithaca Place
## 5 307 Mourning Dove Terrace
## 6 6000 block of South Owasso
As a note, all code will be commented with a max of two lines, and if I want to expound on my notes I will do it after the code block.
For example, the last line is what is called a pipeline. The ‘%>%’ operator takes the result of the thing in front of it and passes it as the first argument of the thing behind it. These pipes usually start with a dataset, in this case violence_full. Select picks columns, and this range means columns 1-5 inclusive. Head selects the first few rows of a dataset. If a line does nothing but create a value or mention a value without storing it, that value is printed.
Also, if you have not already installed a library, go to the packages tab (by default in the bottom right pane in RStudio), click the “install” button, type in the name of the package you are looking for, and click “Install.”
Now we will add the data set on gun laws and the census dataset. The government wasn’t kind enough to provide us with a CSV, so we will have to do a slightly different procedure for the .xlsx file they do provide.
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
gunlaws_full <- read.csv("raw_data.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
gunlaws_full %>% select(1:5)%>%head()
## state year age18longgunpossess age18longgunsale age21handgunpossess
## 1 Alabama 2013 0 0 0
## 2 Alaska 2013 0 1 0
## 3 Arizona 2013 1 0 0
## 4 Arkansas 2013 0 0 0
## 5 California 2013 0 1 0
## 6 Colorado 2013 0 0 0
#library for reading .xlsx
library(xlsx)
#the second argument in read.xlsx is the sheet index
#it can also be the sheet name
pops_full <- read.xlsx("nst-est2017-01.xlsx", 1)
pops_full %>% select(1:5)%>%head()
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
You will note that the dataframe for pops_full has a lot of junk in it. If you open it and the CSV for gun laws in Excell and compare them, you will notice that the .xlsx contains a lot more data than the gun laws CSV, and that most of this data will not be needed. We’ll deal with that in a bit.
If you want more information about reading in weird datafiles, check out this link.
If you look at the website for the categories, you’ll notice that there is no option to download the data. We will have to scrape it. As outlined in that link, we will have to tell R what sort of element it is looking for to collect the data from. If you don’t want to use the chrome extension the link talks about, you will have to inspect the element and select the class that that element is part of. You will put that in the argument of html_nodes. This will return a list of matches; you will have to select the one you want. In my case I wanted the first match it found.
#the library for scraping
library(rvest)
#read in webpage.
#usually a web address
webpage <- read_html("State Firearm Laws - Use the Database Glossary Search Tool.html")
cats_full <- webpage %>%
html_nodes(".table") %>%
.[[1]] %>%
html_table(fill = TRUE)
head(cats_full)
## Code
## 1 age18longgunpossess
## 2 age18longgunsale
## 3 age21handgunpossess
## 4 age21handgunsale
## 5 age21longgunpossess
## 6 age21longgunsale
## Definition
## 1 No possession of long guns until age 18
## 2 Purchase of long guns from licensed dealers and private sellers restricted to age 18 and older
## 3 No possession of handguns until age 21
## 4 Purchase of handguns from licensed dealers and private sellers restricted to age 21 and older
## 5 No possession of long guns until age 21
## 6 Purchase of long guns from licensed dealers and private sellers restricted to age 21 and older
## Category/Subcategory
## 1 Possession regulationsAge restrictions
## 2 Buyer regulationsAge restrictions
## 3 Possession regulationsAge restrictions
## 4 Buyer regulationsAge restrictions
## 5 Possession regulationsAge restrictions
## 6 Buyer regulationsAge restrictions
One snag I ran into here is that the page is dynamically loaded, so it only worked when I right clicked on the page and selected “Save page as…” and put the result in my project’s home direcory.
As mentioned before, there is a lot of junk in the pops_full dataset. We only want the data and done of the fluff. I’ll remind you what this dataset looks like:
head(pops_full)
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
## NA..4 NA..5 NA..6 NA..7 NA..8 NA..9
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 2012 2013 2014 2015 2016 2017
## 4 313993272 316234505 318622525 321039839 323405935 325719178
## 5 55860261 56047732 56203078 56296628 56359360 56470581
## 6 67318295 67534451 67720120 67839187 67978168 68179351
You’ll notice that we only want a few of the columns (called atributes): the ones with the state names and the dates. You will also notice that there are some extra rows (called entities). The following code trims them down.
#library for text manipulation
library(stringr)
#the arrow is a storage operator
pops_trimmed <- pops_full %>%
#removes attributes 2 and 3
select(-(2:3)) %>%
#removes first 8 entities
slice(-(1:8)) %>%
#removes last 6 entities
slice(-(52:58) ) %>%
#change the column names (temporarily for ease of use)
#the ticks let us use code we wouldn't usually put in a pipeline
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017")) %>%
#remove the first character in the string for State
mutate(State = str_sub(State, 2, 25)) %>%
#change 2010 (ticks because its a number) to numeric data
transform(`2010` = as.double(as.character(`2010`))) %>%
#change the column names (again...)
#try without this to see what happens.
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017"))
head(pops_trimmed)
## State 2010 2011 2012 2013 2014 2015
## 1 Alabama 4785579 4798649 4813946 4827660 4840037 4850858
## 2 Alaska 714015 722259 730825 736760 736759 737979
## 3 Arizona 6407002 6465488 6544211 6616124 6706435 6802262
## 4 Arkansas 2921737 2938640 2949208 2956780 2964800 2975626
## 5 California 37327690 37672654 38019006 38347383 38701278 39032444
## 6 Colorado 5048029 5116411 5186330 5262556 5342311 5440445
## 2016 2017
## 1 4860545 4874747
## 2 741522 739795
## 3 6908642 7016270
## 4 2988231 3004279
## 5 39296476 39536653
## 6 5530105 5607154
You will notice that I changed the state names to remove the periods, changed the attribute names, and changed the 2010 attribute to neumaric (helpful later, when we want them to be numbers: it was treating it as text). For more information on string manipulation, check this link.
However, we’re not done. We want to be able to join the violence dataset to this based on year and state eventually. This dataset as it stands is not very condusive to this, it would be better if there was a separate attribute for each state in each year. This also plays into the idea of tidy data (incidentally, this seems to be the paper Professor Bravo used for his class notes–might want to cite this Professor).
#gather(dataset, what attribute names turns into, what the data turns into,
# what not to include)
pops_tidy <- gather(pops_trimmed, year, population, -State)
head(pops_tidy)
## State year population
## 1 Alabama 2010 4785579
## 2 Alaska 2010 714015
## 3 Arizona 2010 6407002
## 4 Arkansas 2010 2921737
## 5 California 2010 37327690
## 6 Colorado 2010 5048029
Also, the violence_full dataset has more attributes than we need, and the date attribute is not seen by R as a date. To use it effectivly as a date, we would have to “date” attribute to a date attribute. This involves some fanciness, and we really don’t need the full date for this project, so we won’t bother. We will create a year attribute for use when joining with the population data for analysis instead.
violence_trimmed <- violence_full %>%
#select the 4 attributes we want
select(date, state, longitude, latitude) %>%
mutate(date = str_sub(date, 1, 4))
head(violence_trimmed)
## date state longitude latitude
## 1 2013 Pennsylvania -79.8559 40.3467
## 2 2013 California -118.3330 33.9090
## 3 2013 Ohio -82.1377 41.4455
## 4 2013 Colorado -104.8020 39.6518
## 5 2013 North Carolina -79.9569 36.1140
## 6 2013 Oklahoma -95.9768 36.2405
This dataset is really cool and I regret chopping out this much data. I am sure someone particularly motivated could do some really cool stuff with all the data this thing has. If you haven’t looked more closely at it yet, do so!
Finally, you’ll notice that the cats_full dataset has two attributes in one column. We need to split this.
head(extract(cats_full, 'Category/Subcategory', into = c("category", "subcategory"), '(.+)([A-Z].+)' ))
## Code
## 1 age18longgunpossess
## 2 age18longgunsale
## 3 age21handgunpossess
## 4 age21handgunsale
## 5 age21longgunpossess
## 6 age21longgunsale
## Definition
## 1 No possession of long guns until age 18
## 2 Purchase of long guns from licensed dealers and private sellers restricted to age 18 and older
## 3 No possession of handguns until age 21
## 4 Purchase of handguns from licensed dealers and private sellers restricted to age 21 and older
## 5 No possession of long guns until age 21
## 6 Purchase of long guns from licensed dealers and private sellers restricted to age 21 and older
## category subcategory
## 1 Possession regulations Age restrictions
## 2 Buyer regulations Age restrictions
## 3 Possession regulations Age restrictions
## 4 Buyer regulations Age restrictions
## 5 Possession regulations Age restrictions
## 6 Buyer regulations Age restrictions
This uses Regex to separate the column into two different columns based on the capital letter.
Because we are taking so long to get to our final data, and because we now have a hyper cool dataset with latitude and longitude, we are going to make cool maps! Hooray!
One thing about this is, as a tangent, I will not be going into extreme detail about this section.
Here, we will make an ultra-cool layered map, with a layer per year of data, showing the location of the reports of gun violence.
#the cool library for making maps
library(leaflet)
#get rid of entities with NA values
violence_map <- na.omit(violence_trimmed)
#start map pipeline
markermap <- leaflet(violence_map) %>%
addTiles() %>%
#get longitute of one year
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(longitude)))},
#same for latitude
~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(latitude)))},
radius = 2,
color = 'red',
#assign this layer an identifier
group="2013")%>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(latitude)))},
radius = 2,
color = 'orange',
group="2014") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(latitude)))},
radius = 2,
color = 'green',
group="2015") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(latitude)))},
radius = 2,
color = 'blue',
group="2016") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(latitude)))},
radius = 2,
color = 'purple',
group="2017") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(latitude)))},
radius = 2,
color = 'black',
group="2018") %>%
addLayersControl(
#gather all the layers togeather.
overlayGroups = c("2013", "2014", "2015", "2016", "2017", "2018"),
options = layersControlOptions(collapsed = FALSE)
) %>%
#start with only the 2013 layer showing.
hideGroup("2014") %>%
hideGroup("2015") %>%
hideGroup("2016") %>%
hideGroup("2017") %>%
hideGroup("2018")
#show it!
markermap
I know that this probably isn’t the most efficient way to write it, but it works…barely. This map is really slow. I recommend when manipulating it you turn off all the layers, turning them back on when the view is where you would like it to be. Usually one would add labels to the data, but it is slow enough as it is I’m not going to push it. Seriously, when trying to make this, I had to restart RStudio about 10 times as I made it less and less fancy (I started with awesome markers, labels, and the like). It just goes to show how big this dataset is.
The map for 2013 is relativly tame. There are a lot of dots, especially in urban areas, and it was less than what I expected. However, when you turn on the filter for any other year, the map just exploads. It blew my mind how many firearm-related incidents were recorded in this dataset. Even for year 2018, which isn’t done yet, has a ton of datapoints. It is cool to zoom to your hometown or an urban center and turn the filters on one by one and see how much of this is going on. As I alluded to in the last paragraph, as flippant as I may be in the introduction paragraphs, this is a really serious issue that needs to be approached with great thought, but more importaintly, action to go with it.
Relating to this project, it makes me wonder if a lot of data is missing for 2013. The Kaggle site reports that this data has “all recorded gun violence incidents in the US between January 2013 and March 2018, inclusive,” but the organization it got its data form only has maps since 2014 and this questionable site shows no dramatic increase between 2013 and 2014. Therefore, I will only use 2014-2107, the complete years, for analysis.
You can find more about the leaflet here.
Here, we are going to finally answer the question of whether or not regulation has an impact on gun violence (in our somewhat flawed, generalized way). We are, unfortunatly, going to have to modify some data sets some more. gunlaws_full already has a total_laws column, so we don’t have to do any more processing on that beyond slicing out all of the middle rows. Population is already mostly ready from the work above. Turns out, though, that we have to remove DC from population because the gun laws dataset doesn’t include DC and remove all years except 2014-2017 inclusive becuase of the violence dataset.
However, our incident database is not finished. It is a list of incidents, we want it to become a list of all of the incidents that happen in a certain state in a certain year. This website demonstrates how to do various aggegations and tallies. A very interesting but seemingly still accurate website has some good tips for how to aggregate data like this for an old library that has been updated, but I like the name of the site so I left it in anyways.
#count separate occurences of each pair of date and state
violence_summ <- violence_trimmed %>%
count(date, state)
#select only the good data, unfortunatly have to remove DC...
violence_summ <- violence_summ %>% filter(2014 <= date & date <= 2017 & state != "District of Columbia")
head(violence_summ)
## # A tibble: 6 x 3
## date state n
## <chr> <fct> <int>
## 1 2014 Alabama 1318
## 2 2014 Alaska 146
## 3 2014 Arizona 556
## 4 2014 Arkansas 572
## 5 2014 California 3732
## 6 2014 Colorado 556
I should have mentioned before what the c() function does! It takes its arguments and makes a vector out of them. You saw this before when I was renaming the columns in the population sheet and when grouping the layers for the map.
Now that this data is in the right format, we must join the population and law counts to it. A join takes two datasets and combines them on rows that are the same. In this example, each dataset has an entity with a unique date and state. Because all of our datasets will have rows that correspond exactly, it is not neccessary to pick a specific type of join, but the types are discussed in detail here.
First though, fix up the laws and the population datasets for the final time…
gunlaws_trimmed <- gunlaws_full %>%
transform(year = as.character(year)) %>%
select(state, year, lawtotal) %>%
filter(2014 <= year & year <= 2017 & state != "District of Columbia")
pops_tidy_trimmed <- pops_tidy %>%
filter(2014 <= year & year <= 2017 & State != "District of Columbia")
combined <- violence_summ %>%
#join on two conditions
left_join(gunlaws_trimmed, by = c("state" = "state", "date" = "year")) %>%
#join again on two conditions
left_join(pops_tidy_trimmed, by = c("state" = "State", "date" = "year")) %>%
#get a incidents_per_person number
mutate(per_pop = n/population)
head(combined)
## # A tibble: 6 x 6
## date state n lawtotal population per_pop
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 2014 Alabama 1318 10 4840037 0.000272
## 2 2014 Alaska 146 4 736759 0.000198
## 3 2014 Arizona 556 11 6706435 0.0000829
## 4 2014 Arkansas 572 11 2964800 0.000193
## 5 2014 California 3732 100 38701278 0.0000964
## 6 2014 Colorado 556 30 5342311 0.000104
Now that we finally have a good dataset, we need to make sure the preconditions to linear regression are met. We will call lawtotal the independant variable as its increase is argued by some to decrease firearm incidents, we will call it the independant variable and the per-population incidence rate the dependant variable.
Let us look at our graph first (using ggplot: here is some mind-numbing information on it):
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()
This data seems like it might slightly decrease with the number of regulation laws, but this is not very clear cut. One way to try and make data more linear is to take the log of the dependant variable. We will try that.
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=log(per_pop))) +
#make labels
labs(x = "Number of Regulation Laws", y = "Log Incidents per Capita", "title" = "Fire Arm Regulation vs. Log Firearm Incidents") +
#make a dotplot
geom_point()
This has centered the data in the graph, but not made it much better. In fact, I would say that it has been made worse.
Reverting back to the original, not-log’d data, we will generate some more graphs to try and see if these conditions are met. First, we have to make a regression model and then feed it into the plot function.
#lm's formula is response[dependant]~terms[indepenent]
fit = lm(per_pop~lawtotal, data=combined)
#the which part selects which plots to show
plot(fit, which=1:2)
Also, we can use some functions from the broom library to get some stats on the regression.
library(broom)
glance(fit)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.02851464 0.02360814 8.895388e-05 5.811614 0.01683451 2 1582.696
## AIC BIC deviance df.residual
## 1 -3159.392 -3149.497 1.566733e-06 198
tidy(fit)
## term estimate std.error statistic p.value
## 1 (Intercept) 2.096411e-04 9.073062e-06 23.105876 4.082934e-58
## 2 lawtotal -5.844784e-07 2.424488e-07 -2.410729 1.683451e-02
After looking at this output and compairing it to the expectations of a good fit set out here, I would say that this is not a good regression line. The residuals in the “Residuals vs. Fitted” graph are all clustered to one side; with a good regression line, there would be no decernable pattern. The Normal Q-Q plot seems to show a curve: a good regression will produce normally distributed resids, which will appear as a horizontal line on that plot. The \(R^2\) value is the percentage of the dependent variables that can be explained by the dependent. In this case, it is about 3%, which is very low. A good regression would have an \(R^2\) value of about 70%. The p-value of the slope is reasonably small (about .017), but I am still wary due to the other factors mentioned.
Even though this line is not very good, we can still plot it. The dark line is the regression itself, the light blue around it is a 95% confidence interval for the line.
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()+
#add prediction
stat_smooth(method="lm")
We see it is decreasing slightly.
However, due to the inconsistancies with the requirements for linear regression listed above, I do not believe that linear regression is the best choice. As far as regression goes, linear was about all that was discussed in class, so I am going to shrug my shoulders and give up on regression. I will though show you what R thinks is the best regression for this data:
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()+
#add prediction
geom_smooth()
See how this line is all wavy? That relationship is not very linear.
Again, since this is a bit of a tangent, I will not be as explicit about my process as I was before. Here, we will attempt to train a decision tree and a random forest to predict whether gun violence will go up or down. A decision tree branches based on the value of a parameter with certain thesholds that are decided by iterations of training data. A random forest is multiple decision trees trained on random subsets of a dataset. To get a more accurate picture of this method’s true performance, we will perform k-fold cross validation. This will train multiple trees and RFs and average the results.
First, we will take the combined dataset form above and try to manipulate it into a form more useful for k-fold cross validation. For the reasons discussed above, we only have the years 2014-2017 to work with. This means 2017 will be our test data and we will only have the two intervals from 2014-2015 and 2015-2016 to train on. This will not be very effective, but it will at least show you the general process for doing your own example with better data. It will be interesting to see how relativly terribly the RF and its extra work will be compared to the decision tree.
#get the results for the 2016-2017 differences
#this will be used as tests for how well the predictor learned the other two gaps.
outcome_df <- combined %>%
#pick the last gap.
filter(date %in% c("2016", "2017")) %>%
select(state, date, per_pop) %>%
#create atributes for per_pop for each year
spread(date, per_pop) %>%
#find the difference in pop
mutate(diff = `2017` - `2016`) %>%
#get a binary difference
mutate(Direction = ifelse(diff>0, "up", "down")) %>%
select(state, Direction)
head(outcome_df)
## # A tibble: 6 x 2
## state Direction
## <chr> <chr>
## 1 Alabama up
## 2 Alaska down
## 3 Arizona up
## 4 Arkansas up
## 5 California up
## 6 Colorado down
predictor_df <- combined %>%
#get the dates before 2016
filter(date <= 2016) %>%
select(date, state, per_pop) %>%
spread(date, per_pop)
head(predictor_df)
## # A tibble: 6 x 4
## state `2014` `2015` `2016`
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 0.000272 0.000213 0.000269
## 2 Alaska 0.000198 0.000515 0.000600
## 3 Arizona 0.0000829 0.0000706 0.0000802
## 4 Arkansas 0.000193 0.000181 0.000241
## 5 California 0.0000964 0.0000829 0.0000920
## 6 Colorado 0.000104 0.000145 0.000147
#create offset matrix on left
matrix_1 <- predictor_df %>%
select(-state) %>%
as.matrix() %>%
.[,-1]
#create offset matrix on right
matrix_2 <- predictor_df %>%
select(-state) %>%
as.matrix() %>%
.[,-ncol(.)]
#get differences
diff_df <- (matrix_1 - matrix_2) %>%
magrittr::set_colnames(NULL) %>%
as_data_frame() %>%
mutate(state = predictor_df$state)
#put the 2016-2017 outcomes back in
final_df <- diff_df %>%
inner_join(outcome_df %>% select(state, Direction), by="state") %>%
mutate(Direction=factor(Direction, levels=c("down", "up")))
head(final_df)
## # A tibble: 6 x 4
## V1 V2 state Direction
## <dbl> <dbl> <chr> <fct>
## 1 -0.0000592 0.0000562 Alabama up
## 2 0.000317 0.0000852 Alaska down
## 3 -0.0000123 0.00000962 Arizona up
## 4 -0.0000121 0.0000601 Arkansas up
## 5 -0.0000136 0.00000919 California up
## 6 0.0000408 0.00000199 Colorado down
(Ok yes much of this is adapted from a homework. However, it still applies here.)
In the code above, two matricies are produced that are offset (ex. matrix_1 does not include the first quarter). This way, when one is suptracted from the other, a matrix of differences in affordability will be produced. This is then joined back with the states and true directions from the 2016-2017 data, which will be used to test the models’ training against.
Now that we have a dataframe of differences, we need to run our k-fold cross validation on it. I have arbitrarily picked k=5.
#end digits of an old phone number
set.seed(4352)
#for the createFolds tree and RF Functions, along with
#predictor
library(randomForest)
library(caret)
library(ROCR)
library(tree)
# create the cross-validation partition, k = # of folds
result_df <- createFolds(final_df$Direction, k=5) %>%
# fit models and gather results. This will run 5 times, one for each fold
purrr::imap(function(test_indices, fold_number) {
# split into train and test for the fold
train_df <- final_df %>%
select(-state) %>%
slice(-test_indices)
test_df <- final_df %>%
select(-state) %>%
slice(test_indices)
# fit the two models
rf <- randomForest(Direction~., data=train_df)
tr <- tree(Direction~., data=train_df)
# gather results
test_df %>%
select(observed_label = Direction) %>%
mutate(fold=fold_number) %>%
mutate(prob_positive_rf = predict(rf, newdata=test_df, type="prob")[,"up"]) %>%
# add predicted labels for rf using a 0.52 probability cutoff
mutate(predicted_label_rf = ifelse(prob_positive_rf > 0.52, "up", "down")) %>%
#WHY CAN'T THEY STANDARDIZE THIS?
mutate(prob_positive_tr = predict(tr, newdata=test_df)[,"up"]) %>%
# add predicted labels for tr using a 0.5 probability cutoff
mutate(predicted_label_tr = ifelse(prob_positive_tr > 0.5, "up", "down"))
}) %>%
# combine the five result data frames into one
purrr::reduce(bind_rows)
head(result_df)
## # A tibble: 6 x 6
## observed_label fold prob_positive_rf predicted_label_~ prob_positive_tr
## <fct> <chr> <dbl> <chr> <dbl>
## 1 up Fold1 0.322 down 0.333
## 2 down Fold1 0.338 down 0
## 3 up Fold1 0.522 up 0
## 4 up Fold1 0.412 down 0.333
## 5 up Fold1 0.500 down 0.556
## 6 up Fold1 0.224 down 0.333
## # ... with 1 more variable: predicted_label_tr <chr>
Here we have caluclated the predictions for both the tree and the RF, one for each in each of the 5 folds. Now, we need to analyse this data somehow.
error_rates <- result_df %>%
#calculate error
mutate(error_rf = observed_label != predicted_label_rf,
error_tr = observed_label != predicted_label_tr) %>%
#gather and take a mean (true(is error) = 1, false = 0)
group_by(fold)%>%
summarize(rf = mean(error_rf), tr = mean(error_tr)) %>%
tidyr::gather(model, error, -fold)
dotplot(error~model, data=error_rates, ylab="Mean Prediction Error")
error_rates %>%
#make a regression
lm(error~model, data=.) %>%
#get data on it
broom::tidy()
## term estimate std.error statistic p.value
## 1 (Intercept) 0.5002020 0.04083358 12.249773 1.832110e-06
## 2 modeltr -0.1248485 0.05774740 -2.161976 6.259637e-02
# create a list of true observed labels
labels <- split(result_df$observed_label, result_df$fold)
# now create a list of predictions for the RF and pass it to the ROCR::prediction function
predictions_rf <- split(result_df$prob_positive_rf, result_df$fold) %>% prediction(labels)
# do the same for the tree
predictions_tr <- split(result_df$prob_positive_tr, result_df$fold) %>% prediction(labels)
# compute average AUC for the RF
mean_auc_rf <- predictions_rf %>%
performance(measure="auc") %>%
# I know, this line is ugly, but that's how it is
slot("y.values") %>% unlist() %>%
mean()
# compute average AUC for the tree
mean_auc_tr <- predictions_tr %>%
performance(measure="auc") %>%
slot("y.values") %>% unlist() %>%
mean()
# plot the ROC curve for the RF
predictions_rf %>%
performance(measure="tpr", x.measure="fpr") %>%
plot(avg="threshold", col="orange", lwd=2)
# plot the ROC curve for the tree
predictions_tr %>%
performance(measure="tpr", x.measure="fpr") %>%
plot(avg="threshold", col="blue", lwd=2, add=TRUE)
# add a legend to the plot
legend("bottomright",
legend=paste(c("rf", "tree"), " AUC:", round(c(mean_auc_rf, mean_auc_tr), digits=3)),
col=c("orange", "blue"))
When looking at the graph, it seems that the random forest is consistantly better than the RF. Indeed, when looking at the output of the regression, we see that the modeltr variable (-.12) is negative. Because the intercept is the average error rate for the random forests, this negative value means that the tree’s average error rate is that much better. The p-value for the modeltr entry is almost significant at the ubiuquitous 95% percent confidence level (0.06), but it is not quite small enough to reasonably reject the null hypothesis that a random forest and a tree have the same true average error.
The areas under the curve are less than stellar. If the AUC = .5, that means that the method in question gets as many false positives as true ones: it is just as good as random. These values, both around .6, aren’t much better than that. Therefore, these classifiers do not seem to be very effective. This is probably because of the really bad training data, as I mentioned above.
Again, this is meant as a framework: once you have it you can plug your better data into it pretty easily.
Back to all that regression stuff…
At the end of the day, we were able to reject the null hypothesis that the slope of a linear regression line between the number of firearm regulations and firearm violence was zero. The new slope is slightly negative, which would please the people in the first, pro-regulation camp. However, the data set has many problems with the preconditions for regression that this line is not very trustworty, and we are still in the dark, beholden to the sensationalist media.
Along the way, we learned how to import data from CSV and .xlsx, tidy the data, represent data on an interactive graph, a little Machine Learning, and some linear regression. I know I said that we would use the data from that table we scraped to add weights to gun laws we found especially important, but I ran out of time for that. We did get to scrape that dataset though, which is a good skill to know.
Thank you, have a good summer, and tell the Professor to give credit to his sources!